First, I plotted histograms of all the variables, to see what their individual distributions were like.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
Fixed acidity is evenly distributed through the sample - the mean and median values are almost the same.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
Volatile acidity is considerably lower fixed acidity, and is slightly right-skewed in its distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
Citric acid is again lower than fixed acidity, with a slight right-skew to the distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
Residual sugar is very right-skewed, and has an outlier that’s a long way out. Let’s look at a boxplot of residual sugar, as it’s easier to see outliers on boxplots.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
Again, a strong right-skew. We’ll look at another boxplot.
A remarkable number of outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
Again, quite a number of outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
More outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
And yet more outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
This histogram looks a little bunched. We’ll look at this a little more closely.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
A nearly-normal distribution for the majority of the data.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
Another nearly-normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
A completely irregular pattern for alcohol distribution.
This is the structure of the data,
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : Factor w/ 7 levels "3","4","5","6",..: 4 4 4 4 4 4 4 4 4 4 ...
and these are summaries of each variable in the data set.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
##
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
##
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
##
## alcohol quality
## Min. : 8.00 3: 20
## 1st Qu.: 9.50 4: 163
## Median :10.40 5:1457
## Mean :10.51 6:2198
## 3rd Qu.:11.40 7: 880
## Max. :14.20 8: 175
## 9: 5
The main features of interest in the white wine dataset are quality and alcohol. These are the two most likely reasons why people buy and drink wine in the first place.
Theory suggests that sulphates and residual sugar have strong influences on the quality and alcohol content of a wine. This theory is investigated below.
I changed the nature of quality from an integer to a factor. This is a table of the results.
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
The histograms for the residual sugar, chlorides and density histograms seemed a little packed. As such, I created boxplots for residual sugar and chlorides, as these had longish outliers, while I adjusted the x-axis limits for the density distribution. pH is a nearly-normal distribution of median value 3.18 and average value 3.188. The rest of the data are generally right-skewed, with low mean and median values.
First, we’ll look at variable histograms broken down by quality. Wines of Quality 9 have been excluded because there are only five them in the entire dataset, and so small a sample can lead to very misleading impressions.
Fixed acidity has the greatest variance among wines of Quality 3, but the distribution is more or less the same across the qualities.
Volatile Acidity shows about the same variance across the qualities, and the number of outliers is noticeable.
This distribution is one of the most compact in the dataset - the inter-quartile range is just 0.12, and the median lines nearly match each other across the different qualities of wine.
Residual sugar has some of the fewest outliers across the variables, with one of the larger inter-quartile ranges.
By contrast, chlorides shows an enormous amount of outliers, with an interquartile range of just 0.014.
A compact distribution across the qualities.
A distribution with a greater variance than for free sulfur dioxide.
The median lines nearly match each other across the different qualities of wine.
Quite compact, with the exception of one large outlier in wines of quality 6.
Varied and relatively even distributions across the qualities.
This is the most informative of all the boxplots. It’s clear from the plot that the alcohol content of a wine in this dataset increases relative to its quality.
A random graph (the Pearson’s Coefficient for residual sugar and alcohol is -0.451), but interesting for one outlier - there is one wine in the dataset that is almost absurdly sweeter than the rest. These are the details of that wine:
## X fixed.acidity volatile.acidity citric.acid residual.sugar
## 2782 2782 7.8 0.965 0.6 65.8
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 2782 0.074 8 160 1.03898 3.39
## sulphates alcohol quality
## 2782 0.69 11.7 6
It’s interesting to note that this is the same wine that showed as an outlier in the residual.sugar and density boxplots. Could there be a relationship between density and residual sugar?
Yes, there is. The Pearson’s Coefficient of Density v Residual Sugar in this dataset is 0.839. We can conclude that there is a strong positive correlation between density and residual sugar in the dataset.
There is a strong negative correlation between density and alcohol content. The Pearson’s Coefficient is -0.78.
There is a positive correlation between total sulfur dioxide and free sulfur dioxide, with a Pearson’s Coefficient of 0.616.
The small sample size (five) for wines of Quality 9 distorted the results. As such, these five were removed to generate the graphs in this part of the report. Of the features of interest, it’s clear that there is a strong relationship between alcohol and quality. There is no case to be made for residual sugar or free sulfur content having any effect on a wine’s quality. The other variables are distributed in more or less the same way across the seven different qualities of wine in the dataset.
There is a strong positive correlation between density and residual sugar (Pearson’s Coefficient: 0.839), and a weaker correlation between total and free sulfur dioxide (Pearson’s Coefficient: 0.616).
The strongest relationship I found was that between density and residual sugar, a relationahip which has a Pearson’s Coefficient of 0.839.
The strong positive correlation between density and residual sugar is consistent through the different qualities of wine.
The strong negative correlation between density and alcohol is also consistent through the different qualities of wine.
The correlation between free and total sulfur dioxide is not consistent across the different qualities of wine. It is most consistent for wines of quality 5 and 6, less so for the higher and lower quality wines.
Having discovered correlations between density and alcohol, density and residual sugar, and free and total sulfur dioxide, it made sense to look at these faceted by quality. The results repeated across the different qualities.
The greatest surprises, or perhaps clearest results, were in the bivariate analysis section. These multivariate plots served chiefly to reinforce what had gone before.
I did not create any model with the dataset. The only features of the dataset that could be modeled are the correlations between density and alcohol and density and residual sugar. These relationships is only of interest to chemists, and chemists are probably already aware of it. They are of little interest to the customer nor of the sommelier - the customer doesn’t want to study chemistry, and the sommelier knows that there’s far more to wine-making than statistical modelling.
This is the most informative plot in the dataset, clearly showing the relationship between alcohol content and wine quality. The five boxplots show the alcohol content dropping over wines of quality 3, 4 and 5 before rising steeply again in wines of quality 6, 7 and 8.
This graph plots density against alcohol for the sample data. The plot demonstates a strong negative correlation between density and alcohol - they have a Pearson’s Coefficient of -0.78. The points on the scatter plot are set at alpha = 0.25 to reduce over-plotting. Some outliers have been removed to make the plot more clear.
Residual sugar is plotted against density in a scatterplot graph, demonstrating a strong positive correlation between them - they have a Pearson’s Coefficient of 0.839. The points are colored according to their quality to add further information to the graph, and are set at alpha = 0.25 to reduce over-plotting. Some outliers have been removed to make the plot more clear.
The chief feature of my investigation was the relationship between quality and alcohol. A clear relationship was found using a boxplot of alcohol v quality – the higher the quality of wine in the dataset, the higher the alcohol content of that wine. The boxplot of alcohol distribution versus quality is the most informative plot in this report.
The strong positive correlation between density and residual sugar was an unexpected result of this investigation, discovered by investigating just one outlier value that repeated in both.
The other strong correlation in this dataset is between density and alcohol, which is not a factor when people are buying wine. While this is a disappointment to statisticians, it is almost certainly good news for sommeliers, who can be reassured that their specialty is indeed more art than science.
The distribution of qualities among the dataset, with many instances of medium-quality wines and relatively fewer instances of the lower and higher extremes, was unfortunate in this regard. Equal samples of all seven qualities would have been better suited to my own investigation.
The most obvious next step would be to look at the data for red wines, and compare these to this white wine dataset. However, the usefulness or otherwise of that comparison is dependent on what the researcher wants from the dataset.
There is a clear division between the attraction of this dataset to a chemist and to an oenophile. The strongest correlations in the dataset are of interest to chemists only, and of limited interest (or intelligibility) to the civilian population. An expansion of the data to red wines will be of interest to a chemist, but of little interest to non-chemists. While a vocation as a chemist doesn’t preclude someone being an oenophile, the true oenophile knows that the sommelier’s trade is always much more art than science.